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Abstract 

A typical small-sample biomarker classification paper discriminates between types of pathology based on, say, 30,000 
genes and a small labeled sample of less than 1 00 points. Some classification rule is used to design the classifier from 
this data, but we are given no good reason or conditions under which this algorithm should perform well. An error 
estimation rule is used to estimate the classification error on the population using the same data, but once again we 
are given no good reason or conditions under which this error estimator should produce a good estimate, and thus 
we do not know how well the classifier should be expected to perform. In fact, virtually, in all such papers the error 
estimate is expected to be highly inaccurate. In short, we are given no justification for any claims. 
Given the ubiquity of vacuous small-sample classification papers in the literature, one could easily conclude 
that scientific knowledge is impossible in small-sample settings. It is not that thousands of papers overtly claim that 
scientific knowledge is impossible in regard to their content; rather, it is that they utilize methods that preclude 
scientific knowledge. In this paper, we argue to the contrary that scientific knowledge in small-sample classification is 
possible provided there is sufficient prior knowledge. A natural way to proceed, discussed herein, is via a paradigm for 
pattern recognition in which we incorporate prior knowledge in the whole classification procedure (classifier design 
and error estimation), optimize each step of the procedure given available information, and obtain theoretical 
measures of performance for both classifiers and error estimators, the latter being the critical epistemological issue. In 
sum, we can achieve scientific validation for a proposed small-sample classifier and its error estimate. 



Review 

Introduction 

It is implicit in the title of this paper that one can enter- 
tain the possibility that scientific knowledge is impossible 
with small-sample classification. In fact, not only might 
one entertain this impossibility, but perusal of the related 
literature would most likely lead one to seriously consider 
that impossibility. It is not that thousands of papers overtly 
claim that scientific knowledge is impossible with regards 
to their content; rather, it is that they utilize methods 
that, ipso facto, cannot lead to knowledge. Even though it 
appears to be almost universally, if tacitly, assumed that 
scientific knowledge is impossible with small-sample clas- 
sification - otherwise, why do so many not aspire to such 
knowledge - we argue to the contrary in this paper that 
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scientific knowledge is possible. But before we make our 
case, let us examine in more detail why the literature may 
lead one to believe otherwise. 

Consider the following common motif for a small- 
sample-classification paper, for instance, one proposing a 
classifier based on gene expression to discriminate types 
of pathology, stages of a disease, duration of survival, 
or some other phenotypic difference. Beginning with 
30,000 features (genes) and less than 100 labeled sample 
points (microarrays), some classification rule (algorithm) 
is selected, perhaps an old one or a new one proposed in 
the paper. We are given no good reason why this algorithm 
should perform well. The classification rule is applied to 
the data and, using the same data, an error estimation 
rule is used to estimate the classification error on the 
population, meaning in practice the error rate on future 
observations. Once again, we are given no good reason 
why this error estimator should produce a good esti- 
mate; in fact, virtually, in all such papers, from what we 
know about the error estimation rule we would expect 
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the estimate to be inaccurate. At this point, one of two 
claims is made. If the classification rule is a well-known 
rule and the purpose of the paper is to produce a classi- 
fier for application (say, a biomarker panel), we are told 
that the authors have achieved their goal of finding such 
a classifier and its accuracy is validated by the error esti- 
mate. If, on the other hand, the purpose is to devise a new 
classification rule, we are told that the efficacy of the new 
rule has been validated by its performance, as measured 
by the error estimate or, by several such error estimates on 
several different data sets. In either case, we are given no 
justification for the validation claim. Moreover, in the sec- 
ond case, we are not told the conditions under which the 
classification rule should be expected to perform well or 
how well it should be expected to perform. 

Amid all of this vacuity, perhaps the reporting of error 
estimates whose accuracy is a complete mystery is the 
most puzzling from a scientific perspective. To borrow a 
metaphor [1], one can imagine Harold Cramer leisurely 
sailing on the Baltic off the coast of Stockholm, taking 
in the sights and sounds of the sea, when suddenly a 
gene-expression classifier to detect prostate cancer pops 
into his head. No classification rule has been applied, 
nor is that necessary. All that matters is that Cramer's 
imagination has produced a classifier that operates on 
the feature-label distribution of interest with a sufficiently 
small error rate. Since scientific validity depends on the 
predictive capacity of a model, while an appropriate clas- 
sification rule is certainly beneficial to classifier design, 
epistemologically, the error rate is paramount. Were we to 
know the feature-label distribution of interest, we could 
exactly determine the error rate of the proposed classi- 
fier. Absent knowledge of the feature-label distribution, 
the actual error must be estimated from data and the 
accuracy of the estimate judged from the performance 
of the error estimation rule employed. Consequently, any 
paper that applies an error estimation rule without pro- 
viding a performance characterization relevant to the data 
at hand is scientifically vacuous. Given the near univer- 
sality of vacuous small-sample classification papers in the 
literature, one could easily reach the conclusion that sci- 
entific knowledge is impossible in small-sample settings. 
Of course, this would beg the question of why people are 
writing vacuous papers and why journals are publishing 
them. Since the latter are sociological questions, they are 
outside the domain of the current paper. We will focus on 
the scientific issues. 

Epistemological digression 

Before proceeding, we digress momentarily for some very 
brief comments regarding scientific epistemology (refer- 
ring to [2] for a comprehensive treatise and to [3] for a 
discussion aimed at biology and including classifier valid- 
ity). Our aim is narrow, simply to emphasize the role of 



prediction in scientific knowledge, not to indulge in broad 
philosophical issues. 

A scientific theory consists of two parts: (1) a mathe- 
matical model composed of symbols (variables and rela- 
tions between the variables), and (2) a set of operational 
definitions that relate the symbols to data. A mathematical 
model alone does not constitute a scientific theory. The 
formal mathematical structure must yield experimental 
predictions in accord with experimental observations. As 
put succinctly by Richard Feynman, "It is whether or not 
the theory gives predictions that agree with experiment. 
It is not a question of whether a theory is philosophically 
delightful, or easy to understand, or perfectly reasonable 
from the point of view of common sense" [4] . Model valid- 
ity is characterized by predictive relations, without which 
the model lacks empirical content. Validation requires 
that the symbols be tied to observations by some semantic 
rules that relate not necessarily to the general principles 
of the mathematical model themselves but to conclu- 
sions drawn from the principles. There must be a clearly 
defined tie between the mathematical model and experi- 
mental methodology. Philipp Frank writes, "Reichenbach 
had explicitly pointed out that what is needed is a bridge 
between the symbolic system of axioms and the proto- 
cols of the laboratory. But the nature of this bridge had 
been only vaguely described. Bridgman was the first who 
said precisely that these relations of coordination con- 
sist in the description of physical operations. He called 
them, therefore, operational definitions" [5]. Elsewhere, 
we have written, "Operational definitions are required, 
but their exact formulation in a given circumstance is 
left open. Their specification constitutes an epistemo- 
logical issue that must be addressed in mathematical 
(including logical) statements. Absent such a specifica- 
tion, a purported scientific theory is meaningless" [6]. 

The validity of a scientific theory depends on the choice 
of validity criteria and the mathematical properties of 
those criteria. The observational measurements and the 
manner in which they are to be compared to the math- 
ematical model must be formally specified. The validity 
of a theory is relative to this specification, but what is 
not at issue is the necessity of a set of relations tying 
the model to operational measurements. Formal specifi- 
cation is mandatory and this necessarily takes the form 
of mathematical (including logical) statements. Formal 
specification is especially important in stochastic settings 
where experimental outcomes reflect the randomness of 
the stochastic system so that one must carefully define 
how the outcomes are to be interpreted. 

Story telling and intuitive arguments cannot suffice. 
Not only is complex- system behavior often unintuitive, 
but stochastic processes and statistics often contradict 
naive probabilistic notions gathered from simple experi- 
ments like rolling dice. Perhaps even worse is an appeal 
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to pretty pictures drawn with computer software. The lit- 
erature abounds with data partitioned according to some 
clustering algorithm whose partitioning performance is 
unknown or, even more strangely, justified by some 
"validation index" that is poorly, if at all, correlated with 
the error rate of the clustering algorithm [7]. The pretty 
pictures are usually multi-colored and augmented with all 
kinds of attractive-looking symbols. They are inevitably 
followed by some anecdotal commentary. Although all 
of this may be delightful, it is scientifically meaningless. 
Putting the artistic touches and enormous calculations 
aside, all we are presented with is a radical empiricism. 
Is there any knowledge here? Hans Reichenbach answers, 
"A mere report of relations observed in the past cannot 
be called knowledge. If knowledge is to reveal objec- 
tive relations of physical objects, it must include reliable 
predictions. A radical empiricism, therefore, denies the 
possibility of knowledge" [2]. A collection of measure- 
ments together with a commentary on the measurements 
is not scientific knowledge. Indeed, the entire approach 
"denies the possibility of knowledge," so that its adoption 
constitutes a declaration of meaninglessness. 

Classification error 

For two-class classification, the population is character- 
ized by a feature-label distribution F for a random pair 
(X, Y), where X is a vector of D features and Y is the binary 
label, 0 or 1, of the class containing X. A classifier is a 
function, ijr, which assigns a binary label, VKX), to each 
feature vector. The error, s[\(f], of ifr is the probability, 
P(ifr(X) ^ Y), that \[f yields an erroneous label. A clas- 
sifier with minimum error among all classifiers is known 
as a Bayes classifier for the feature-label distribution. 
The minimum error is called the Bayes error. Epistemo- 
logically, the error is the key issue since it quantifies the 
predictive capacity of the classifier. 

Abstractly, any pair M = (if,Sf) composed of a func- 
tion i/f : R D -> {0,1} and a real number e [0,1] 
constitutes a classifier model, with being simply a num- 
ber, not necessarily specifying an actual error probability 
corresponding to iff. M. becomes a scientific model when 
it is applied to a feature-label distribution. In practice, 
the feature-label distribution is unknown and a classifi- 
cation rule is used to design a classifier r/r„ from a 
random sample S n = {(Xi, Y\), (X2, Y2), ■ ■ . , (X B , Y n )} of 
pairs drawn from the feature-label distribution. Note that 
a classification rule is a sequence of rules depending on the 
sample size n. If feature selection is involved, then it is part 
of the classification rule. A designed classifier produces 
a classifier model, namely, (if„,s[ \jf„] ). Since the true 
classifier error e[ i/r„] depends on the feature-label distri- 
bution, which is unknown, s[^/„] is unknown. The true 
error must be estimated by an estimation rule, S„. Thus, 
the random sample S„ yields a classifier i/f„ = *„(S„) and 



an error estimate s[ tj/ n ] = S n (S n ), which together con- 
stitute a classifier model (if n >H i^n] )• Overall, classifier 
design involves a rule model (*„, S„) used to determine 
a sample-dependent classifier model (if n ,s[ \jr n ] ). Both 
(ijf n , s[ \j/ n ] ) and (\jr n , s[ \jf„] ) are random pairs relative to 
the sampling distribution. 

Given a feature-label distribution, error estimation 
accuracy is commonly measured by the mean-square error 
(MSE), defined by MSE(e) = E[ (e - e) 2 ], where for nota- 
tional ease we denote e[ and e[ by e and e, respec- 
tively, or, equivalently, by the square root of the MSE, 
known as the root-mean-square (RMS). The expectation 
used here is relative to the sampling distribution induced 
by the feature-label distribution. The MSE is decomposed 
into the bias, Bias(e) = E[e — e], of the error estima- 
tor relative to the true error, and the deviation variance, 
Varaev(e) = Var(e - e), by 

MSE(e) = Var de y(£) + Bias(e) 2 . (1) 

When a large amount of data is available, the sample can 
be split into independent training and test sets, the classi- 
fier being designed on the training data and its error being 
estimated by the proportion of errors on the test data, 
which is known as the holdout estimator. For holdout, we 
have the distribution-free bound RMS(£holdout|5 H - m , F) < 
1/V4m, where m is the size of the test sample, S n - m is 
the training sample and F is any feature-label distribution 
[8]. RMS(e|Z) indicates that the expectation in the RMS 
is conditioned on the random vector Z. But when data are 
limited, the sample cannot be split without leaving too lit- 
tle data to design a good classifier. Hence, training and 
error estimation must take place on the same data set. 

The consequences of training-set error estimation are 
readily explained by the following formula for the devia- 
tion variance: 

Var dev (e) = cr 2 + cr 2 - 2p(T- £ a £ , (2) 

where cr 2 , cr 2 , and p are the variance of the error estimate, 
the variance of the error, and the correlation between 
the estimated and true errors, respectively. The deviation 
variance is driven down by small variances or a correlation 
coefficient near 1. 

Consider the popular cross-validation error estimator. 
For it, the error is estimated on the training data by ran- 
domly splitting the training data into k folds (subsets), 
S l n , for i = 1,2, ...,k, training k classifiers on S„ — S l n , 
for i = 1,2, ...,k, calculating the proportion of errors of 
each designed classifier on the appropriate left-out fold, 
and then averaging these proportions to obtain the cross- 
validation estimate of the originally designed classifier. 
Various enhancements are made, such as by repeating 
the process some number of times and averaging. Letting 
k=n yields the leave-one-out estimator. The problem with 
cross-validation is evident from (2): for small samples, 
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it has large variance and little correlation with the true 
error. Hence, although with small folds, cross-validation 
does not suffer too much from bias, it typically has large 
deviation variance. 

To illustrate the matter, we reproduce an example from 
[9] based on real patient data from a study involving 
microarrays prepared with RNA from breast tumor spec- 
imens from 295 patients, 115 and 180 belonging to the 
good-prognosis and poor-prognosis classes, respectively. 
The dataset is reduced to the 2,000 genes with highest 
variance, these are reduced to 10 via t test feature selec- 
tion, and a classifier is designed using linear discriminant 
analysis (LDA). In the simulations, the data are split into 
two sets. The first set, consisting of 50 examples drawn 
without replacement from the full dataset, is used for both 
training and error estimation via leave-one-out cross- 
validation. The remaining examples are used as a hold-out 
test set to get an accurate estimate of the true error, which 
is taken as the true error. There is an assumption that such 
a hold-out size will give an accurate estimate of the true 
error. This procedure is repeated 10,000 times. Figure 1 
shows the scatter plot for the pairs of true and estimated 
errors, along with the linear regression of the true error 
on the estimated error. The means are shown on the 
axes. What we observe is typical for small samples: large 
variance and negligible regression between the true and 
estimated errors [10]. Indeed, one even sees negatively 
sloping regression lines for cross-validation and boot- 
strap (another resampling error estimator), and negative 




Figure 1 Linear regression between cross-validation and the 
true error. Scatter plot and linear regression for cross-validation 
(horizontal axis) and the true error (vertical axis) with sample size 50 for 
linear discrimination between two classes of breast cancer patients. 



correlation between the true and cross-validation esti- 
mated errors has been mathematically demonstrated in 
some basic models [11]. Such error estimates are worth- 
less and can lead to a huge waste of resources in trying to 
reproduce them [9]. 

RMS bounds 

Suppose a sample is collected, a classification rule 
applied, and the classifier error estimated by an error- 
estimation rule S M to arrive at the classifier model 
(i/f„, e[ \j/ n ] ). If no assumptions are posited regarding the 
feature-label distribution, then the entire procedure is 
completely distribution-free. There are three possibilities. 
First, if no validity criterion is specified, then the clas- 
sifier model is ipso facto epistemologically meaningless. 
Second, if a validity criterion is specified, say RMS, and 
no distribution-free results are known about the RMS for 
and E„, then again the model is meaningless. Third, 
if there exist distribution-free RMS bounds concerning 
and a n , then these bounds can, in principle, be used 
to quantify the performance of the error estimator and 
thereby quantify model validity. 

Regarding the third possibility, the following is an exam- 
ple of a distribution-free RMS bound for the leave-one- 
out error estimator with the discrete histogram rule and 
tie-breaking in the direction of class 0 [8]: 

RMS(? loo |F) < /l±^f + -=L= (3) 
y n y/it (n — 1) 

where F is any feature-label distribution. Although this 
bound holds for all distributions, it is useless for small 
samples: for n = 200 this bound is 0.506. In general, there 
are very few cases in which distribution-free bounds are 
known and, when they are known, they are useless for 
small samples. 

Distribution-based bounds are needed. These require 
knowledge of the RMS, which means knowledge concern- 
ing the second-order moments of the joint distribution 
between the true and estimated errors. More generally, 
to fully understand an error estimator we need to know 
its joint distribution with the true error. Oddly, this prob- 
lem has historically been ignored in pattern recognition, 
notwithstanding the fact that error estimation is the epis- 
temological ground for classification. Going back to the 
1970s there were some results on the mean and variance of 
some error estimators for the Gaussian model using LDA. 
In 1966, Hills obtained the expected value of the resub- 
stitution and plug-in estimators in the univariate model 
with known common variance [12]. The resubstitution 
estimate is simply a count of the classification errors 
on the training data and the plug-in estimate is found 
by using the data to estimate the feature-label distribu- 
tion and then finding the error of the designed classifier 
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on the estimated distribution. In 1972, Foley obtained 
the expected value of resubstitution in the multivari- 
ate model with known common covariance matrix [13]. 
In 1973, Sorum derived results for the expected value 
and variance for both resubstitution and leave-one-out 
in the univariate model with known common variance 
[14]. In 1973, McLachlan derived an asymptotic rep- 
resentation for the expected value of resubstitution in 
the multivariate model with unknown common covari- 
ance matrix [15]. In 1975, Moran obtained new results 
for the expected value of resubstitution and plug-in for 
the multivariate model with known covariance matrix 
[16]. In 1977, Goldstein and Wolf obtained the expected 
value of resubstitution for multinomial discrimination 
[17]. Following the latter, there was a gap of 15 years 
before Davison and Hall derived asymptotic represen- 
tations for the expected value and variance of boot- 
strap and leave-one-out in the univariate Gaussian model 
with unknown and possibly different covariances [18]. 
This is the only paper we know of providing analytic 
results for moments of common error estimators between 
1977 and 2005. None of these papers provided repre- 
sentation of the joint distribution or representation of 
second-order mixed moments, which are needed for 
the RMS. 

This problem has only recently been addressed begin- 
ning in 2005, in particular, for the resubstitution and 
leave-one-out estimators. For the multinomial model, 
complete enumeration was used to obtain the marginal 
distributions for the error estimators [11] and then the 
joint distributions [19]. Exact closed-form representa- 
tions for second-order moments, including the mixed 
moments, were obtained, thereby obtaining exact RMS 
representations for both estimators [11]. For the Gaussian 
model using LDA in 2009, we obtained the exact marginal 
distributions for both estimators in the univariate model 
(known but not necessarily equal class variances) and 
approximations in the multivariate model (known and 
equal class covariance matrices) [20]. Subsequently, these 
were extended to the joint distributions for the true 
and estimated errors in a Gaussian model [21]. Recently 
exact closed-form representations for the second-order 
moments in the univariate model without assuming equal 
covariances were discovered, thereby providing exact 
expression of the RMS for both estimators [22]. More- 
over, double asymptotic representations for the second- 
order moments in the multivariate model, sample size and 
dimension approaching infinity at a fixed rate between 
the two, were found, thereby providing double asymp- 
totic expressions for the RMS [23]. Finite sample approx- 
imations from the double asymptotic method have been 
shown to possess better accuracy than various simple 
asymptotic representations (although much more work is 
needed on this issue) [24,25]. 



Validity 

Let us now consider validity. An obvious way to proceed 
would be to say that a classifier model (if,Sf) is valid 
for the feature-label distribution F to the extent that Sf 
approximates the classifier error, s[\jr], on F, where the 
degree of approximation is measured by some distance 
between £f and e[ i/r]. For a classifier i(f n designed from a 
specific sample, this would mean that we want to measure 
some distance between e = s[\j/„] and s = e[^„], say 
|e — e|. To do this, we would have to know the true error 
and to know that we would need to know F. But if we knew 
F, we would use the Bayes classifier and would not need to 
design a classifier from sample data. Since it is the preci- 
sion of the error estimate that is of consequence, a natural 
way to proceed would be to characterize validity in terms 
of the precision of the error estimator e[\jr n ]= S n (S n ) 
as an estimator of s[ijf„], say by RMS(e). This makes 
sense because both the true and estimated errors are ran- 
dom functions of the sample and the RMS measures their 
closeness across the sampling distribution. But again there 
is a catch: the RMS depends on F, which we do not know. 
Thus, given the sample without knowledge of F, we cannot 
compute the RMS. 

To proceed, prior knowledge is required, in the sense 
that we need to assume that the actual (unknown) feature- 
label distribution belongs to some uncertainty class, U, 
of feature-label distributions. Once RMS representations 
have been obtained for feature-label distributions in 
U, distribution-based RMS bounds follow: RMS(fi) < 
max GeW RMS(e|G) , where RMS(e|G) is the RMS of the 
error estimator under the assumption that the feature- 
label distribution is G. We do not know the actual feature- 
label distribution precisely, but prior knowledge allows 
us to bound the RMS. For instance, consider using LDA 
with a feature-label distribution having two equally prob- 
able Gaussian class-conditional densities sharing a known 
covariance matrix. For this model the Bayes error is a one- 
to-one decreasing function of the distance, m, between 
the means. Figure 2a shows the RMS to be a one-to-one 
increasing function of the Bayes error for leave-one-out 
for dimension D = 10 and sample sizes n = 20, 40, 60, 
the RMS and Bayes errors being on the y and x axes, 
respectively. 

Assuming a parameterized model in which the RMS is 
an increasing function of the Bayes error, £b ay , we can 
pose the following question: Given sample size n and 
X > 0, what is the maximum value, maxBayes(l), of the 
Bayes error such that RMS(e) < X? If RMS is the mea- 
sure of validity and X represents the largest acceptable 
RMS for the classifier model to be considered meaning- 
ful, then the epistemological requirement is characterized 
by maxBayes(l). Given the relationship between model 
parameters and the Bayes error, the inequality fibay < 
maxBayes(A) can be solved in terms of the parameters to 
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Figure 2 RMS and maxBayes(A.). (a) RMS (y-axis) as a function of the Bayes error (x-axis) for leave-one-out with dimension D = 10 and sample 
sizes n = 20 (plus sign), 40 (triangle), 60 (circle); (b) maxBayes(A) curves corresponding to the RMS curves in part (a). 



arrive at a necessary modeling assumption. In the preced- 
ing Gaussian example, since £b a y is a decreasing function 
of m, we obtain an inequality m > m{k). Figure 2b 
shows the maxBayes(^) curves corresponding to the RMS 
curves in Figure 2a [26]. These curves show that, assum- 
ing Gaussian class-conditional densities and a known 
common covariance matrix, further assumptions must be 
made to insure that the RMS is sufficiently small to make 
the classifier model meaningful. 

To have scientific content, small-sample classification 
requires prior knowledge. Regarding the feature-label dis- 
tribution, there are two extremes: (1) the feature-label 
distribution is known, in which case the entire classifi- 
cation problem collapses to finding the Bayes classifier 
and Bayes error, so there is no classifier design or error 
estimation issue; and (2) the uncertainty class consists of 
all feature-label distributions, the distribution-free case, 
and we typically have no bound, or one that is too loose 
for practice. In the middle ground, there is a trade-off 
between the size of the uncertainty class and the size of 
the sample. The uncertainty class must be sufficiently con- 
strained (equivalently, the prior knowledge must be suf- 
ficiently great) that an acceptable bound can be achieved 
with an acceptable sample size. 

MMSE error estimation 

Given that one needs a distributional model to achieve 
useful performance bounds for classifier error estimation, 
an obvious course of action is to find or define a prior 
over the uncertainty class of feature-label distributions, 
and then find an optimal minimum-mean-square-error 
(MMSE) error estimator relative to that class [27]. This 
results in a Bayesian approach with the uncertainty class 
being given a prior distribution and the data being used 
to construct a posterior distribution, which quantifies 
everything we know about the feature-label distribution. 



Benefits of the Bayesian approach are (1) we can incorpo- 
rate prior knowledge in the whole classification procedure 
(classifier design and error estimation), which, as we have 
argued above, is desperately needed in a small-sample set- 
ting where the data provide only a meager amount of 
information; (2) given the mathematical framework, we 
can optimize each step of the procedure, further address- 
ing the poor performance suffered in small samples; and 
(3) we can obtain theoretical measures of the perfor- 
mance for both arbitrary classifiers (via the MMSE error 
estimator) and arbitrary error estimators (via the sample 
conditioned MSE), perhaps the most important advantage 
epistemologically. We begin with an overview of optimal 
MMSE error estimation. 

Assume that a sample point has a prior probability c 
of coming from class 0, and that the class-0 conditional 
distribution is parameterized by 0q and class 1 is param- 
eterized by 9\. Considering both classes, our model is 
completely parameterized by 6 = {c,9o,0\}. Given a ran- 
dom sample, S n , we design a classifier and wish to 
minimize the MSE between its true error, £ (a function 
of 8 and ijf n ), and an error estimate, £~ (a function of S n 
and i/f n ). A key realization is that the expectation in 
the MSE may now be taken over the uncertainty class 
conditioned on the observed sample, rather than over 
the sampling distribution for a fixed (unknown) feature- 
label distribution. The MMSE error estimator is thus the 
expected true error, £~(i/f M , S n ) = Eg[e(\j/ n ,0)\S n ] . The 
expectation given the sample is over the posterior den- 
sity of 6 , denoted by it*{9). Thus, we write the Bayesian 
MMSE error estimator with the shorthand's = E w * [ e]. 

The Bayesian error estimate is not guaranteed to be the 
optimal error estimate for any particular feature-label dis- 
tribution but optimal for a given sample, and assuming 
the parameterized model and prior probabilities, it is both 
optimal on average with respect to MSE and unbiased 
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when averaged over all parameters and samples. These 
implications apply for any classification rule as long as the 
classifier is fixed given the sample. To facilitate analytic 
representations, we assume c, 6q and 9\ are all mutu- 
ally independent prior to observing the data. Denote the 
marginal priors of c, 6q and 9\ by tt(c), jt(6q) and tt($i), 
respectively, and suppose data are used to find each poste- 
rior, 7t*(c), tt*(6q) and n*(6i), respectively. Independence 
is preserved, i.e., it*(c, 6 0 , 9\) = it* (c)n* (8 0 )jt* (6i) [27]. 

If \jr n is a trained classifier given by ifr n (x) = 0 if x e Rq 
and i/f„ (x) = 1 if x € R\, where Ro and Ri are measur- 
able sets partitioning the sample space, then the true error 
of i/f„ under the distribution parameterized by 6 may be 
decomposed as 

s(fn, 0) = c( f 6o (x|0) dx + (1 - c) [ f 6l (x|l) dx 
JRi Jr 0 

(4) 

= ce o (^„,0 o ) + (l-c)e 1 (iA„,0i), 

where fe y (x\y) is the class-j conditional density assuming 
parameter Q y is true and e y is the error contributed by class 
y. Owing to the posterior independence between c and 9q 
and between c and 9\, the Bayesian MMSE error estimator 
can be expressed as [28] 

t(ir n , S n ) = E n * [ c] E n * [ £°] +(1 - E n * [ c] )E n * [ e 1 ] • 

(5) 

With a fixed sample and classifier, and given 6 y , the true 
error, E y (if n ,O y ), is deterministic. Thus, letting & y be the 
parameter space of 9 y , 

E„*[ s y] = [ s y (if tl ,6 y )jt*(0 y )de y . (6) 

J&y 

Just as the true error for a fixed feature-label distribution 
is found from the class-conditional densities, (x|j), the 
Bayesian MMSE error estimator for an uncertainty class 
can be found from effective class- conditional densities, 
which are derived by taking the expectations of the in- 
dividual class-conditional densities with respect to the 
posterior distribution, 

f(x\y)= [ f 8y (x\y)7T*(e y )d6 y . (7) 

Specifically, we obtain an equation for the expected true 
error that parallels that of the true error in (4) [29]: 

e(^,5„)=E^M [ f(x\0)dx+(l--E 7t *[c])[ f(x\l)dx. 

JRl JRo 

(8) 

Application of Bayesian error estimation to real data, 
in particular gene-expression microarray data, has been 



addressed in [30]. This work provides C code implement- 
ing the Bayesian error estimator for Gaussian distribu- 
tions and normal-inverse-Wishart priors for both linear 
classifiers, with exact closed-form representations, and 
non-linear classifiers, where closed form-solutions are 
not available and we instead implement a Monte-Carlo 
approximation. The code and a toolbox of related utili- 
ties are publicly available. In [30] we discuss the suitability 
of a Gaussian model with normal-inverse-Wishart pri- 
ors for microarray data and propose a feature selection 
scheme employing a Shapiro-Wilk Gaussianity test to val- 
idate Gaussian modeling assumptions. Furthermore, we 
propose a methodology for calibrating normal-inverse- 
Wishart priors for microarray data based on a method- 
of-moments approach using features discarded by the 
feature-selection scheme. 

Sample-conditioned MSE 

The RMS of an error estimator is used to characterize the 
validity of a classifier model. As we have discussed, if we 
are in possession of RMS expressions for the feature-label 
distributions in an uncertainty class, we can bound the 
RMS, so as to insure a given level of performance. In the 
case of MMSE error estimation, the priors provide a math- 
ematical framework that can be used for both the anal- 
ysis of any error estimator and the design of estimators 
with desirable properties or optimal performance. The 
posteriors of the distribution parameters imply a (sample- 
conditioned) distribution on the true classifier error. This 
randomness in the true error comes from our uncer- 
tainty in the underlying feature-label distribution (given 
the sample). Within the assumed model, this sample- 
conditioned distribution of the true error contains the full 
information about error estimator accuracy and we may 
speak of moments of the true error (for a fixed sample 
and classifier), in particular the expectation, variance, and 
sample-conditioned MSE, as opposed to simply the MSE 
relative to the sampling distribution as in classical error 
estimation. 

Finding the sample-conditioned MSE of MMSE 
Bayesian error estimators amounts to evaluating the 
variance of the true error conditioned on the observed 
sample [28]. The sample-conditioned MSE converges to 
zero almost surely in both discrete and Gaussian models 
provided in [31], where closed form expressions for the 
MSE are available. Further, the exact MSE for arbitrary 
error estimators falls out naturally in the Bayesian model. 
That is, if "e. is a constant representing an arbitrary error 
estimate computed from the sample, then the MSE of £. 
can be evaluated directly from that of the Bayesian error 
estimator: 

MSE(e.|5„) = MSE(?|5 H ) + (s - e.) 2 . 
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MSE(e.|5 B ), as well as its square root RMS(e.|5„), are 
minimized whene" = £.. 

In a classical approach, nothing is known given a sam- 
ple, whereas in a Bayesian approach, the sample condi- 
tions uncertainty in the RMS and different samples may 
condition it to different extents. Figure 3 shows proba- 
bility densities of the sample-conditioned RMS for both 
the leave-one-out estimator and Bayesian error estimator 
in a discrete model with b = 16 bins. The simulation 
generates 10,000 distributions drawn from a prior given 
in [31] and 1,000 samples from each distribution. The 
unconditional RMS (averaged over both distributions and 
samples) for both error estimators is also shown, as well as 
the distribution-free RMS bound on leave-one-out given 
in (3). In Figure 3, the RMS of the Bayesian error estimator 
tends to be very close to 0.05 whereas the leave-one- 
out error estimator has a long tail with substantial mass 
between 0.05 and 0.2, demonstrating that different sam- 
ples can condition the RMS to a very significant extent. 
In addition, the unconditional RMS of the Bayesian error 
estimator is less than half that of leave-one-out, while 
Devroye's distribution-free bound on the unconditional 
RMS is too loose to be useful. Hence, not only does a 
Bayesian framework permit us to obtain an optimal error 
estimator and its RMS conditioned on the sample, but 
performance improvement can be significant. 

In [31], a bound on the sample-conditioned RMS of 
the Bayesian error estimator is provided for the dis- 
crete model. With any classifier, beta priors on c and 
Dirichlet priors on the bin probabilities satisfying mild 
conditions, and given a sample S„, RMS(£bee|Sk) < 
1/V4«- For comparison, consider the holdout bound 
RMS(e h oldoutl 

S n — m ,F) < 1/V 4m, where m is the size 
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of the test sample. Both bounds still hold if we remove 
the conditioning, and in this way they become compa- 
rable. Since l/\/4« < 1 /V 4m, under a Bayesian model 
not only does using the full sample to train the classifier 
result in a lower true error, but we expect to achieve bet- 
ter RMS performance using training-data error estimation 
than we would by holding out the entire sample for error 
estimation. This is a testament to the power of modeling. 

Optimal classification 

Since prior knowledge is required to obtain a good 
error estimate in small-sample settings, an obvious course 
of action would be to utilize that knowledge for clas- 
sifier design [29,32]. Whereas ordinary Bayes classi- 
fiers minimize the misclassification probability when the 
underlying distributions are known, optimal Bayesian 
classification trains a classifier from data assuming the 
feature-label distribution is contained in a family param- 
eterized by 9 e 0 with some assumed prior density 
over the states. Formally, we define an optimal Bayesian 
classifier, Vt)BO as any classifier satisfying 



E w * [e«r O BC.0)] < E** [eW,9)] 



(9) 



for all ijr € C, where C is an arbitrary family of classi- 
fiers. Under the Bayesian framework, this is equivalent to 
minimizing the probability of error as follows: 

P (ir n (X) # Y\S n ) = E„* [P (f n (X) # Y\6, S n )] 
= E n * [eW m 0)] 

= t{f n ,S n ). (10) 

An optimal Bayesian classifier can be found by brute 
force using the closed form solutions for the expected 
true error (the Bayesian error estimator), when available. 
However, if C is the set of all classifiers (with measurable 
decision regions), then an optimal Bayesian classifier can 
be found analogously to Bayes classification for a fixed 
distribution using the effective class-conditional densi- 
ties. To wit, we can realize an optimal solution without 
explicitly finding the error for every classifier because the 
solution can be found pointwise. Specifically, an optimal 
Bayesian classifier, i/t)BC> satisfying (9) for all iff e C, 
the set of all classifiers with measurable decision regions, 
exists and is given pointwise by [29] 



iAobc (x) = 



0 ifE„.[c]/(x|0)>(l- 

1 otherwise. 



E^[c])/(x|l), 



(11) 



If Ejr* [c]= 0, then this optimal Bayesian classifier is a 
constant and always assigns class 1, and if E^*[c] = 1 it 
always assigns class 0. Hence, we will typically assume that 
0 < E n *[c] < 1. 
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Essentially, the optimal thing to do is to find the Bayes 
classifier using/ (x\y) as the true class-conditional distri- 
butions. This is like a plug-in rule, only / (x\y) is not nec- 
essarily in the family of distributions [f$ y (x\y)}, but some 
other kind of density that happens to result in the optimal 
classifier. We find the optimal Bayesian classifier without 
explicitly evaluating the expected true error, E^-* [s(tj/, 0)], 
for every possible classifier iff. With regards to both opti- 
mal Bayesian classification and Bayesian MMSE error 
estimation, / (x|j) contains all of the necessary informa- 
tion in the model about the class-conditional distributions 
and we do not have to deal with the uncertainty class 
or priors directly. Upon defining a model, we find / (x\y) 
(which depends on the sample because it depends on 
it*) and then the whole problem is solved by treating 
/ (x\y) as the true distribution: optimal classification, the 
error estimate of the optimal classifier, and the optimal 
error estimate for arbitrary classifiers. That being said, 
there is no short-cut to finding the sample-conditioned 
MSE via the effective density; indeed, there is no notion 
of variance in the true error of a fixed classifier under 
the effective class-conditional densities. Moreover, the 
approach of using the effective class-conditional densi- 
ties finds an optimal Bayesian classifier over all possible 
classifiers. On the other hand, there may be advan- 
tages to restricting the space of classifiers, for example, 
in a Gaussian model one may prefer linear classifiers 
where closed-form Bayesian error estimators have been 
found [33]. 

We will present a Bayesian MMSE classifier for the 
discrete model, which has already been solved. More 
generally, what we are proposing is not just a few new 
classifiers, but a new paradigm in classifier design focused 
on optimization over a concrete mathematical framework. 
Furthermore, this work ties Bayesian modeling and the 
Bayesian error estimator together with the old problem of 
optimal robust filtering; indeed, in the absence of obser- 
vations, the optimal Bayesian classifier reduces to the 
Bayesian robust optimal classifier [32,34]. 

Optimal discrete classification 

To illustrate concepts in optimal Bayesian classification, 
we consider discrete classification, in which the sample 
space is discrete with b bins. We let pi and q, be the 
class-conditional probabilities in bin i e {1, ...,b} for 
class 0 and 1, respectively, and we define Uj and Vj to 
be the number of sample points observed in bin /' e 
{l,...,b} from class 0 and 1, respectively. The class sizes 

are given by n 0 = Ya=i U and "l = Z)f=i v i- A 
general discrete classifier assigns each bin to a class, so 

Ir n :{l,...,b}->{0,1}. 

The discrete Bayesian model defines 8o = \p\, . . . ,pb-i] 
and 9\ = [qi,..., qb-i\- The last bin probabilities are not 
needed since p b = 1 - YaZiPi and <lb = 1 - YaZ\ 1i- 



The parameter space of 9q is defined to be the set of a valid 
bin probabilities, e.g., \p\, . . . ,pb-i] € ©o if and only if 
0 < pi < 1 for i e {1, . . . ,b - 1} and YdZlPi < 1. The 
parameter space ©i is defined similarly. With the para- 
metric model established, we define conjugate Dirichlet 
priors 



n(Oo) oc Y\ P? and oc ]~~[ q" 1 



a -1 



(12) 



i=l 



For proper priors, the hyperparameters, a\ for i e 
{1, . . . , b} and j € {0, 1}, must be positive, and for uniform 
priors a? = 1 for all i and y. In this setting, the posteriors 
are again Dirichlet, and when normalized they are given 
by 



Ui+af-l 



r(»o + Eti«f) n , 

n*=i r \ U k + a k ) i= i 
Uk=l r \ v k + <x k ) i=i 



(13) 



(14) 



where T is the Gamma function. 

In the discrete model, for /' e {1, . . . ,b} the effective 
class-conditional densities can be shown to be equal to 



/(/10) = 



"0 + Ei=i « 



S) and/Oll) = 



Vj + <*j 

"i +£?=!«; ' 

(15) 



/ (j\0) and / (j\ l) may be viewed as effective bin probabil- 
ities for each class after combining prior knowledge and 
observed data. Hence, from (8), the Bayesian MMSE error 
estimator for an arbitrary classifier \jt„ is 



U 



f=i no + E/=i «. 



0 %»(/)=! 



+ (1-Er.[c]) 



(16) 



where \e is an indicator function equal to one if E is 
true and zero otherwise. Exactly the same expression was 
derived using a brute-force approach in [27]. The optimal 
Bayesian classifier may now be found directly using (11): 



lifE^M "ffl" 0 <(1-E„.[c])-^- 
"o+Ei = i <*i 

0 otherwise. 



«i+£ti«r 



(17) 



The optimal Bayesian classifier minimizes the Bayesian 
error estimator by minimizing each term in the sum (16). 
This is achieved by assigning VobcO) tne c l ass with 
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the smaller constant scaling the indicator function. The 
expected error of the optimal classifier is 



£OBC = ^2 min 
7=1 



B„*[c] 



"0 + E?=i « 



o' 



a-E w .[c]) 



(18) 



In the special case where we have uniform c and uniform 
priors for the bin probabilities (aj = 1 for all i and j), the 
Bayesian MMSE error estimate is 



* no + lUj + 1 m + 1 Vj + 1 



7=1 



w + 2 «i 



(19) 



the optimal Bayesian classifier is 



lif^(£/ / + l)<H±l(V5 + l), 
0 otherwise, 



(20) 



and the expected error of the optimal classifier is 



£OBC : 



7=1 



«o + 1 Uj + 1 «1 



IV; 



2 w 0 + £ « + 2«i+& 



(21) 



Hence, under uniform priors, when the total number 
of samples observed in each class is the same (no = «i), 
the optimal Bayesian classifier is equivalent to the classi- 
cal discrete histogram rule, which assigns a class to each 
bin by a majority vote: Vdhr(/) = 1 if Uj < Vj and 
</t>HR(/) = 0 if Uj > Vf, otherwise, the discrete histogram 
rule is not necessarily optimal within an arbitrary Bayesian 
framework. 

We take a moment to compare optimal Bayesian clas- 
sification over an uncertainty class of distributions with 
Bayes classification for a fixed feature-label distribution. 
With fixed class-0 probability c and bin probabilities pt 
and qi, the true error of an arbitrary classifier, xjr, is given 
by 



e = c Pjh(j)=i + (1 - c)qjl$(j)=o. 



(22) 



7=1 



Note a similarity to (16) and (19). The Bayes classifier is 
given by i/f Bayes (/') = Hicpj < (l—c)qj and zero otherwise, 



corresponding to (17) and (20). Finally, the Bayes error is 
given by 



e Bayes 



^2 min{cpj, (1 

7=1 



(23) 



corresponding to (18) and (21). Throughout, c corre- 
sponds to Ex* [ c],pj corresponds to the effective bin prob- 
ability f(j\0) = (Uj + af)/(n 0 + Eti «°) and similarly q 
corresponds to the effective bin probability f(J\l). In this 
case, the effective density is a member of our uncertainty 
class (which contains all possible discrete feature-label 
distributions), so that the optimal thing to do is sim- 
ply plug the effective parameters in the fixed-distribution 
problem. 

That being said, the effective density is not always a 
member of our uncertainty class. Consider an example 
with D = 2 features, an uncertainty class of Gaussian 
class-conditional distributions with independent arbitrary 
covariances, and a proper posterior with fixed class-0 
probability c = 0.5 (hyperparameters are provided in 
[32]). We consider three classifiers. First is a plug-in clas- 
sifier, which is the Bayes classifier corresponding to the 
posterior expected parameters, c = 0.5, fio = [ 0, 0, . . . , 0], 
H\ = [ 1, 1, . . . , 1], and En = Si = Id- Since the expected 
covariances are homoscedastic, this classifier is linear. 
The second is a state-constrained optimal Bayesian clas- 
sifier, i^scobc> m which we search for a state with corre- 
sponding Bayes classifier having smallest expected error 
over the uncertainty class [34]. Since the Bayes classi- 
fier for any particular state in the uncertainty class is 
quadratic, this classifier is quadratic. Finally, we have the 
optimal Bayesian classifier, which has been solved analyt- 
ically in [29], although details are omitted here. In this 
case, the effective densities are not Gaussian but multi- 
variate student's t distributions, resulting in an optimal 
Bayesian classifier having a polynomial decision bound- 
ary that is higher than quadratic order. Figure 4 shows 
lAplug-in (red), iAscobc (black) and Vobc (green). Level 
curves for the class-conditional distributions correspond- 
ing to the expected parameters used in ^plug-in are shown 
in red dashed lines, and level curves for the distribu- 
tions in the state corresponding to iAscobc are shown 
in black dashed lines. These were found by setting the 
Mahalanobis distance to 1. Each classifier is quite distinct, 
and in particular, the optimal Bayesian classifier is non- 
quadratic even though all class-conditional distributions 
in the uncertainty class are Gaussian. 

To demonstrate the performance advantage of optimal 
Bayesian classification via a simulated experiment, we 
return to the discrete classification problem. Let c and 
the bin probabilities be generated randomly according to 
uniform prior distributions. For each fixed feature-label 
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X 1 

Figure 4 Classifiers for an independent arbitrary covariance 
Gaussian model. Classifiers for an independent arbitrary covariance 
Gaussian model with D = 2 features and proper posteriors. Whereas 
the optimal Bayesian classifier (in green) is polynomial with expected 
true error 0.2007, the state-constrained optimal Bayesian classifier (in 
black) is quadratic with expected true error 0.2061 and the plug-in 
classifier (in red) is linear with expected true error 0.2078. These 
expected true errors are averaged over the posterior on the 
uncertainty class of states. 



distribution, a binomial(«, c) experiment is used to deter- 
mine the number of sample points in class 0 and the bin 
for each point is drawn according to the bin probabilities 
corresponding to its class, thus generating a non-stratified 
random sample of size n. Both the histogram rule and 
the new optimal Bayesian classifier from (20), assuming 
correct priors, are trained from the sample. The true error 
for each classifier is also calculated exactly via (22) . This 
is repeated 100,000 times to obtain the average true error 
for each classification rule, presented in Figure 5 for b = 2, 
4 and 8 bins. Observe that the average performance of 
optimal Bayesian classification is indeed superior to that 
of the discrete histogram rule, especially for larger bin 



sizes. However, note that optimal Bayesian classifiers are 
not guaranteed to be optimal for a specific distribution 
(the optimal classifier is the Bayes classifier), but only opti- 
mal when averaged over all distributions in the assumed 
Bayesian framework. 

Conclusions 

Scientific knowledge is possible for small-sample 
classification. 

Given the importance of classification throughout sci- 
ence and the crucial epistemological role played by error 
estimation, it is remarkable that only one paper providing 
analytic results for moments of common error estima- 
tors was published between 1977 and 2005, and that up 
until 2005, there were no papers providing representa- 
tion of the joint distribution or of the second-order mixed 
moments. Today, we are paying the price for this dearth 
of activity as we are now presented with very large fea- 
ture sets and small samples across different disciplines, in 
particular, in high-throughput biology, where the advance 
of medical science is being hamstrung by a lack of basic 
knowledge regarding pattern recognition. Moreover, in 
spite of this obvious crippling lack of knowledge, there is 
only a minuscule effort to rectify the situation, whereas 
billions of dollars are wasted on gathering an untold quan- 
tity of data that is useless absent the requisite statistical 
knowledge to make it useful. 

No doubt this unfortunate situation would make for a 
good sociological study. But that is not our field of exper- 
tise. Nonetheless, we will put forth a comment made by 
Thomas Kailath in 1974, about the time that fundamen- 
tal research in error estimation for small-sample classi- 
fication came to a halt. He writes, "It was the peculiar 
atmosphere of the sixties, with its catchwords of 'building 
research competence,' 'training more scientists,' etc., that 
supported the uncritical growth of a literature in which 
quantity and formal novelty were often prized over sig- 
nificance and attention to scholarship. There was little 
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concern for fitting new results into the body of old ones; 
it was important to have 'new' results!" [35]. Although 
Kailaths observation was aimed at signal processing, the 
"peculiar atmosphere" of which he speaks is not limited 
to any particular discipline; rather, he had perceived an 
"uncritical growth of a literature" lacking "attention to 
scholarship." One can only wonder what Prof. Kailaths 
thoughts are today when he surveys a research landscape 
that produces orders of magnitude more papers but pro- 
duces less knowledge than that produced by the relative 
handful of scientists, statisticians, and engineers a half 
century ago. For those who would question this latter 
observation in pattern recognition, we suggest a study of 
the early papers by such pioneers as Theodore Anderson, 
Albert Bowker, and Rosedith Sitgreaves. 
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